Positions and intervals provide addresses for anything that can be mapped to a reference genome:
It is no wonder that positions and intervals are key fields in most genomic data files.
For example, next-generation sequencing (NGS) technologies can map all kinds of functional biomolecular information along this coordinate system.
Show a grey line - genome - with some genes, orange intervals - like on the figure below Then show another track - let’s say H3K4me3 (just say that that you can do experiment where you can measure gene activity or something like that) And then pose a question - we would like to find a subset of active genes in this cell line or whatever Therefore - OVERLAP ! Then you say that this is the most basic/fundamental operation in genomics
The core objects in bioframe are pandas DataFrames of genomic intervals, or BedFrames defined by chrom, start and end columns (the names are configurable). We don't use opaque wrapper objects or DataFrame subclasses.
import bioframe as bf
df1 = pd.DataFrame([
['chr1', 1, 5],
['chr1', 3, 8],
['chr1', 8, 10],
['chr1', 12, 14]],
columns=['chrom', 'start', 'end']
)
display(df1)
bf.vis.plot_intervals(df1, show_coords=True, xlim=(0,16))
plt.title('bedFrame1 intervals');
| chrom | start | end | |
|---|---|---|---|
| 0 | chr1 | 1 | 5 |
| 1 | chr1 | 3 | 8 |
| 2 | chr1 | 8 | 10 |
| 3 | chr1 | 12 | 14 |
df2 = bf.from_any(
[['chr1', 4, 8],
['chr1', 10, 11]],
name_col='chrom'
)
display(df2)
bf.vis.plot_intervals(df2, show_coords=True, xlim=(0,16), colors='lightpink')
plt.title('bedFrame2 intervals');
| chrom | start | end | |
|---|---|---|---|
| 0 | chr1 | 4 | 8 |
| 1 | chr1 | 10 | 11 |
The most common operation is to calculate the overlaps between two sets of genomic intervals.
overlaps = bf.overlap(df1, df2, how='inner', suffixes=('_1','_2'))
overlaps
| chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | |
|---|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | chr1 | 4 | 8 |
| 1 | chr1 | 3 | 8 | chr1 | 4 | 8 |
df1 = pd.DataFrame([
['chr1', 1, 5],
['chr1', 3, 8],
['chr1', 8, 10],
['chr1', 12, 14],
],
columns=['chrom', 'start', 'end']
)
bf.vis.plot_intervals(df1, show_coords=True, xlim=(0,16))
df_annotated = bf.cluster(df1, min_dist=0)
display(df_annotated)
bf.vis.plot_intervals(df_annotated, labels=df_annotated['cluster'], show_coords=True, xlim=(0,16))
| chrom | start | end | cluster | cluster_start | cluster_end | |
|---|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | 0 | 1 | 10 |
| 1 | chr1 | 3 | 8 | 0 | 1 | 10 |
| 2 | chr1 | 8 | 10 | 0 | 1 | 10 |
| 3 | chr1 | 12 | 14 | 1 | 12 | 14 |
Instead of returning cluster assignments,bioframe.merge returns a new dataframe of merged genomic intervals. As with bioframe.cluster, using min_dist=0 and min_dist=None gives different results.
If min_dist=0, this returns a dataframe of two intervals:
df_merged = bf.merge(df1, min_dist=0)
display(df_merged)
bf.vis.plot_intervals(df_merged, show_coords=True, xlim=(0,16))
| chrom | start | end | n_intervals | |
|---|---|---|---|---|
| 0 | chr1 | 1 | 10 | 3 |
| 1 | chr1 | 12 | 14 | 1 |
If min_dist=None, this returns a dataframe of three intervals:
df_merged = bf.merge(df1, min_dist=None)
display(df_merged)
bf.vis.plot_intervals(df_merged, show_coords=True, xlim=(0,16))
| chrom | start | end | n_intervals | |
|---|---|---|---|---|
| 0 | chr1 | 1 | 8 | 2 |
| 1 | chr1 | 8 | 10 | 1 |
| 2 | chr1 | 12 | 14 | 1 |
It is often useful not only to find features that overlap, but also features that are nearby along the genome.
closest_intervals = bf.closest(df1, df2, suffixes=('_1','_2'))
display(closest_intervals)
| chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | distance | |
|---|---|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | chr1 | 4 | 8 | 0 |
| 1 | chr1 | 3 | 8 | chr1 | 4 | 8 | 0 |
| 2 | chr1 | 8 | 10 | chr1 | 4 | 8 | 0 |
| 3 | chr1 | 12 | 14 | chr1 | 10 | 11 | 1 |
By default, bioframe.closest reports overlapping intervals. This can be modified by passing ignore_overlap=True. Note the closest pair #2 and #3, which did not overlap, remain the same.
Closest intervals within a single DataFrame can be found simply by passing a single dataframe to bioframe.closest. The number of closest intervals to report per query interval can be adjusted with k.
bf.closest(df1, k=2)
| chrom | start | end | chrom_ | start_ | end_ | distance | |
|---|---|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | chr1 | 3 | 8 | 0 |
| 1 | chr1 | 1 | 5 | chr1 | 8 | 10 | 3 |
| 2 | chr1 | 3 | 8 | chr1 | 1 | 5 | 0 |
| 3 | chr1 | 3 | 8 | chr1 | 8 | 10 | 0 |
| 4 | chr1 | 8 | 10 | chr1 | 3 | 8 | 0 |
| 5 | chr1 | 8 | 10 | chr1 | 12 | 14 | 2 |
| 6 | chr1 | 12 | 14 | chr1 | 8 | 10 | 2 |
| 7 | chr1 | 12 | 14 | chr1 | 3 | 8 | 4 |
For two sets of genomic features, it is often useful to calculate the number of basepairs covered and the number of overlapping intervals. While these are fairly straightforward to compute from the output of bioframe.overlap with pandas.groupby and column renaming, since these are very frequently used, they are provided as core bioframe functions.
df1_coverage = bf.coverage(df1, df2)
display(df1_coverage)
| chrom | start | end | coverage | |
|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | 1 |
| 1 | chr1 | 3 | 8 | 4 |
| 2 | chr1 | 8 | 10 | 0 |
| 3 | chr1 | 12 | 14 | 0 |
df1_coverage_and_count = bf.count_overlaps(df1_coverage, df2)
display(df1_coverage_and_count)
| chrom | start | end | coverage | count | |
|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | 1 | 1 |
| 1 | chr1 | 3 | 8 | 4 | 1 |
| 2 | chr1 | 8 | 10 | 0 | 0 |
| 3 | chr1 | 12 | 14 | 0 | 0 |
bf.vis.plot_intervals(
df1_coverage_and_count,
show_coords=True, xlim=(0,16),
labels = [f'{cov} bp, {n} intervals'
for cov, n in zip(df1_coverage_and_count.coverage, df1_coverage_and_count['count'])])
bf.vis.plot_intervals(df2, show_coords=True, xlim=(0,16), colors='lightpink')
Bioframe has two functions for computing differences between sets of intervals: at the level of basepairs and at the level of whole intervals.
Basepair-level subtraction is performed with bioframe.subtract:
subtracted_intervals = bf.subtract(df1, df2)
display(subtracted_intervals)
| chrom | start | end | |
|---|---|---|---|
| 0 | chr1 | 1 | 4 |
| 1 | chr1 | 3 | 4 |
| 2 | chr1 | 8 | 10 |
| 3 | chr1 | 12 | 14 |
bf.vis.plot_intervals(subtracted_intervals, show_coords=True, xlim=(0,16))
Interval-level differences are calculated with bioframe.setdiff:
setdiff_intervals = bf.setdiff(df1, df2)
display(setdiff_intervals)
| chrom | start | end | |
|---|---|---|---|
| 2 | chr1 | 8 | 10 |
| 3 | chr1 | 12 | 14 |
bf.vis.plot_intervals(setdiff_intervals, show_coords=True, xlim=(0,16))
Equally important to finding which genomic features overlap is finding those that do not. bioframe.complement returns a BedFrame of intervals not covered by any intervals in an input BedFrame.
Complments are defined relative to a genomic view. Here this is provided as a dictionary with a single chromosome of length 15:
df_complemented = bf.complement(df1, view_df={'chr1':15})
display(df_complemented)
| chrom | start | end | view_region | |
|---|---|---|---|---|
| 0 | chr1 | 0 | 1 | chr1 |
| 1 | chr1 | 10 | 12 | chr1 |
| 2 | chr1 | 14 | 15 | chr1 |
bf.vis.plot_intervals(df_complemented, show_coords=True, xlim=(0,16), colors='lightpink')
If no view is provided, complement is calculated per unique chromosome in the input with right limits of np.iinfo(np.int64).max.
df_complemented = bf.complement(df1)
display(df_complemented)
| chrom | start | end | view_region | |
|---|---|---|---|---|
| 0 | chr1 | 0 | 1 | chr1 |
| 1 | chr1 | 10 | 12 | chr1 |
| 2 | chr1 | 14 | 9223372036854775807 | chr1 |
view_df = pd.DataFrame(
[
["chr1", 0, 4, "chr1p"],
["chr1", 5, 9, "chr1q"],
["chrX", 0, 50, "chrX"],
["chrM", 0, 10, "chrM"]],
columns=["chrom", "start", "end", "name"],
)
df_unsorted = pd.DataFrame([
['chrM', 3, 8],
['chrM', 1, 5],
['chrX', 12, 14],
['chrX', 8, 10]],
columns=['chrom', 'start', 'end']
)
bf.sort_bedframe(df_unsorted)
| chrom | start | end | |
|---|---|---|---|
| 0 | chrM | 1 | 5 |
| 1 | chrM | 3 | 8 |
| 2 | chrX | 8 | 10 |
| 3 | chrX | 12 | 14 |
bf.sort_bedframe(df_unsorted, view_df)
| chrom | start | end | |
|---|---|---|---|
| 0 | chrX | 8 | 10 |
| 1 | chrX | 12 | 14 |
| 2 | chrM | 1 | 5 |
| 3 | chrM | 3 | 8 |
bioframe.select(df_unsorted,'chrX:8-14')
| chrom | start | end | |
|---|---|---|---|
| 2 | chrX | 12 | 14 |
| 3 | chrX | 8 | 10 |
#import hg
#hg.Viewconf.from_url("https://resgen.io/api/v1/viewconfs/Y_omIrpERgG01VsqmtMLVA/?raw=1")
Genomic intervals are one of the most prevalent data structures in computational genome biology, and used to represent features ranging from genes, to DNA binding sites, to disease variants. Operations on genomic intervals provide a language for asking questions about relationships between features.
While there are excellent interval arithmetic tools for the command line, they are not smoothly integrated into Python, one of the most popular general-purpose computational and visualization environments.
Bioframe is a library to enable flexible and performant operations on genomic interval dataframes in Python. Bioframe extends the Python data science stack to use cases for computational genome biology by building directly on top of two of the most commonly-used Python libraries, numpy and pandas.
The bioframe API enables flexible name and column orders, and decouples operations from data formats to avoid unnecessary conversions, a common scourge for bioinformaticians. Bioframe achieves these goals while maintaining high performance and a rich set of features.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import bioframe
Toy examples with plots
Figure from paper
ctcf_peaks = bioframe.read_table(
"https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz",
schema='narrowPeak'
)
ctcf_peaks
| chrom | start | end | name | score | strand | fc | -log10p | -log10q | relSummit | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr19 | 48309541 | 48309911 | . | 1000 | . | 5.04924 | -1.0 | 0.00438 | 185 |
| 1 | chr4 | 130563716 | 130564086 | . | 993 | . | 5.05052 | -1.0 | 0.00432 | 185 |
| 2 | chr1 | 200622507 | 200622877 | . | 591 | . | 5.05489 | -1.0 | 0.00400 | 185 |
| 3 | chr5 | 112848447 | 112848817 | . | 869 | . | 5.05841 | -1.0 | 0.00441 | 185 |
| 4 | chr1 | 145960616 | 145960986 | . | 575 | . | 5.05955 | -1.0 | 0.00439 | 185 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 40582 | chr8 | 22574315 | 22574744 | . | 1000 | . | 561.11939 | -1.0 | 4.90268 | 243 |
| 40583 | chr15 | 56246029 | 56246402 | . | 1000 | . | 569.05663 | -1.0 | 4.90268 | 191 |
| 40584 | chr1 | 150979463 | 150979845 | . | 1000 | . | 580.28482 | -1.0 | 4.90268 | 194 |
| 40585 | chr16 | 57649040 | 57649402 | . | 1000 | . | 602.95266 | -1.0 | 4.90268 | 173 |
| 40586 | chr12 | 54379625 | 54380042 | . | 1000 | . | 627.60723 | -1.0 | 4.90268 | 203 |
40587 rows × 10 columns
jaspar_url = 'http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/hg38/'
jaspar_motif_file = 'MA0139.1.tsv.gz'
ctcf_motifs = bioframe.read_table(
jaspar_url + jaspar_motif_file,
schema='jaspar',
skiprows=1
)
ctcf_motifs
| chrom | start | end | name | score | pval | strand | |
|---|---|---|---|---|---|---|---|
| 0 | chr1 | 11163 | 11182 | CTCF | 811 | 406 | - |
| 1 | chr1 | 11222 | 11241 | CTCF | 959 | 804 | - |
| 2 | chr1 | 11280 | 11299 | CTCF | 939 | 728 | - |
| 3 | chr1 | 11339 | 11358 | CTCF | 837 | 455 | - |
| 4 | chr1 | 11401 | 11420 | CTCF | 829 | 439 | - |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 770044 | chrY_KI270740v1_random | 4600 | 4619 | CTCF | 815 | 412 | - |
| 770045 | chrY_KI270740v1_random | 5601 | 5620 | CTCF | 815 | 412 | - |
| 770046 | chrY_KI270740v1_random | 7601 | 7620 | CTCF | 815 | 412 | - |
| 770047 | chrY_KI270740v1_random | 8602 | 8621 | CTCF | 815 | 412 | - |
| 770048 | chrY_KI270740v1_random | 19753 | 19772 | CTCF | 806 | 395 | - |
770049 rows × 7 columns
df = bioframe.overlap(
ctcf_peaks,
ctcf_motifs,
# suffixes=('_1', ''),
return_index=True,
how='left',
)
motifs_per_peak = df.groupby(["index"])["index_"].count().values
bins = np.arange(0, np.max(motifs_per_peak))
counts, _ = np.histogram(motifs_per_peak, bins)
plt.bar(bins[:-1], height=counts, align='center')
plt.xlabel('number of overlapping motifs per peak')
plt.ylabel('number of peaks')
plt.semilogy();
print(f'fraction of peaks without motifs {np.round(np.sum(motifs_per_peak==0)/len(motifs_per_peak),2)}');
fraction of peaks without motifs 0.14
df
| index | chrom | start | end | name | score | strand | fc | -log10p | -log10q | relSummit | index_ | chrom_ | start_ | end_ | name_ | score_ | pval_ | strand_ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | chr19 | 48309541 | 48309911 | . | 1000.0 | . | 5.04924 | -1.0 | 0.00438 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 1 | 1 | chr4 | 130563716 | 130564086 | . | 993.0 | . | 5.05052 | -1.0 | 0.00432 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 2 | 2 | chr1 | 200622507 | 200622877 | . | 591.0 | . | 5.05489 | -1.0 | 0.00400 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 3 | 3 | chr5 | 112848447 | 112848817 | . | 869.0 | . | 5.05841 | -1.0 | 0.00441 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 4 | 4 | chr1 | 145960616 | 145960986 | . | 575.0 | . | 5.05955 | -1.0 | 0.00439 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48626 | 40584 | chr1 | 150979463 | 150979845 | . | 1000.0 | . | 580.28482 | -1.0 | 4.90268 | 194.0 | 40407 | chr1 | 150979668 | 150979687 | CTCF | 925.0 | 678.0 | - |
| 48627 | 40585 | chr16 | 57649040 | 57649402 | . | 1000.0 | . | 602.95266 | -1.0 | 4.90268 | 173.0 | 261215 | chr16 | 57649185 | 57649204 | CTCF | 918.0 | 656.0 | - |
| 48628 | 40585 | chr16 | 57649040 | 57649402 | . | 1000.0 | . | 602.95266 | -1.0 | 4.90268 | 173.0 | 261216 | chr16 | 57649229 | 57649248 | CTCF | 829.0 | 439.0 | + |
| 48629 | 40586 | chr12 | 54379625 | 54380042 | . | 1000.0 | . | 627.60723 | -1.0 | 4.90268 | 203.0 | 153951 | chr12 | 54379782 | 54379801 | CTCF | 863.0 | 511.0 | - |
| 48630 | 40586 | chr12 | 54379625 | 54380042 | . | 1000.0 | . | 627.60723 | -1.0 | 4.90268 | 203.0 | 153952 | chr12 | 54379835 | 54379854 | CTCF | 885.0 | 564.0 | - |
48631 rows × 19 columns
%%bash
echo "Hello World"
which sed
Hello World /bin/sed
import hg
config = hg.Viewconf.from_url('https://higlass.io/api/v1/viewconf?d=default')
config